How to run whisper as onnx?

laro1 · May 30, 2025, 8:02am

I converted huggignface whisper model to onnx with optimum-cli:

optimum-cli export onnx --model openai/whisper-small.en  whispersmallen

I got 4 onnx files:

decoder_model_merged.onnx
decoder_model.onnx
decoder_with_past_model.onnx
encoder_model.onnx

Now I want to write code which loads whisper (as onnx) and run it on 1.wav file.

How to do it ?
When using hf whisper model, I just run one model (and not 2 sperates models: encoder/decdoer)

mahmutc · May 30, 2025, 8:25am

1. Install Required Libraries

pip install onnxruntime librosa transformers numpy

2. Preprocess Audio into Log-Mel Spectrogram

import numpy as np
import librosa
from transformers import WhisperFeatureExtractor

# Load audio
audio, sr = librosa.load("1.wav", sr=16000)
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small.en",sampling_rate=16000)

# Convert to log-mel spectrogram
inputs = feature_extractor(audio, return_tensors="np",sampling_rate=16000)
input_features = inputs["input_features"]  # shape: (1, 80, 3000)

3. Load ONNX Encoder and Run It

import onnxruntime as ort

# Load encoder
encoder_sess = ort.InferenceSession("whispersmallen/encoder_model.onnx")

# Run encoder
encoder_outputs = encoder_sess.run(
    output_names=["last_hidden_state"],
    input_feed={"input_features": input_features}
)[0]

4. Autoregressive Decoding Loop

Whisper uses decoder input tokens (decoder_input_ids) and the encoder_hidden_states to generate tokens one by one.

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small.en")

decoder_sess = ort.InferenceSession("whispersmallen/decoder_model.onnx")

# Start with <|startoftranscript|>
decoder_input_ids = np.array([[tokenizer.convert_tokens_to_ids("<|startoftranscript|>")]], dtype=np.int64)

generated_ids = []

for _ in range(100):  # max 100 tokens
    outputs = decoder_sess.run(
        output_names=["logits"],
        input_feed={
            "input_ids": decoder_input_ids,
            "encoder_hidden_states": encoder_outputs
        }
    )
    
    next_token_logits = outputs[0][:, -1, :]  # shape: (1, vocab_size)
    next_token_id = np.argmax(next_token_logits, axis=-1)[0]
    
    if next_token_id == tokenizer.eos_token_id:
        break

    generated_ids.append(next_token_id)
    decoder_input_ids = np.append(decoder_input_ids, [[next_token_id]], axis=-1)

5. Decode Output

transcription = tokenizer.decode(generated_ids, skip_special_tokens=True)
print("Transcription:", transcription)

Summary

You now need to explicitly handle encoder and decoder.
ONNX does not wrap both in one model.
The decoder loop is autoregressive: it feeds its output token as input in the next step.
Pre/postprocessing can still use HuggingFace.

ChatGPT provided this, and I tested it — it works.

Topic		Replies	Views
How to configure ONNX models from Hugging Face to use model options in C++? 🤗Optimum	0	487	November 10, 2023
How to use export-onnx.py to change the pytorch_model.bin to onnx? Beginners	1	24	March 12, 2025
How to export MarianMT to ONNX with output_attentions=True? Beginners	1	275	October 9, 2024
RuntimeError when loading custom ONNX model exported from Whisper Beginners	0	142	March 17, 2024
When exporting seq2seq models with ONNX, why do we need both decoder_with_past_model.onnx and decoder_model.onnx? 🤗Optimum	12	4484	March 7, 2024